Improving Phrase Extraction via MBR Phrase Scoring and Pruning

نویسندگان

  • Nan Duan
  • Mu Li
  • Ming Zhou
  • Lei Cui
چکیده

One of the major reasons for translation errors in phrase-based SMT systems is the incorrect phrases induced from inaccuracy word-aligned parallel data. In this paper, we propose a novel approach that uses the minimum Bayes-risk (MBR) principle to improve the accuracy of phrase extraction. Our approach performs as a four-stage pipeline: first, bilingual phrases are extracted from parallel corpus using a standard phrase induction method; then, phrases are separated into groups under specific constraints and scored using an MBR model; next, word alignment links contained in phrases with their MBR scores lower than a certain threshold are pruned in the parallel data; last, a new phrase table is learned from the link-pruned parallel data and used in SMT decoding. We evaluate our approach on the NIST Chinese-to-English MT tasks, and show significant improvements on parallel data sets of different scales.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Estimating Phrase Pair Relevance for Translation Model Pruning

We present pruning strategies for translation models that are based on estimating the relevance of phrase pairs. We apply the overall translation system to a set of data and collect a number of statistics for each phrase pair. Using these statistics in various scoring terms we are able to significantly outperform baseline pruning methods and we can show that the number of phrase pairs can be re...

متن کامل

Improving Relative-Entropy Pruning using Statistical Significance

Relative Entropy-based pruning has been shown to be efficient for pruning language models for more than a decade ago. Recently, this method has been applied to Phrase-based Machine Translation, and results suggest that this method is comparable the state-of-art pruning method based on significance tests. In this work, we show that these 2 methods are effective in pruning different types of phra...

متن کامل

Phrase Table Pruning via Submodular Function Maximization

Phrase table pruning is the act of removing phrase pairs from a phrase table to make it smaller, ideally removing the least useful phrases first. We propose a phrase table pruning method that formulates the task as a submodular function maximization problem, and solves it by using a greedy heuristic algorithm. The proposed method can scale with input size and long phrases, and experiments show ...

متن کامل

Improving Phrase-Based Machine Translation

Current state-of-the-art machine translation systems use a phrase-based scoring model for choosing among candidate translations in a target language, typically English. These models are deemed phrase-based because candidate sentence scores are in large part a product of phrase translation probabilities. These translation probabilities must be learned in some unsupervised manner from a pair of s...

متن کامل

Translation Model Pruning via Usage Statistics for Statistical Machine Translation

We describe a new pruning approach to remove phrase pairs from translation models of statistical machine translation systems. The approach applies the original translation system to a large amount of text and calculates usage statistics for the phrase pairs. Using these statistics the relevance of each phrase pair can be estimated. The approach is tested against a strong baseline based on previ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011